(Prosper Loan Data Set)

by (Ahmed Khaled Mohamed)

Preliminary Wrangling

It is the prosper loan data set, that contains information about loan listings and related variables including borrower as well as lender information. It contains variables related to Borrower such as credit rating, prosper rating etc. Moreover, the dataset also has lender information.

1.1. Explore the dataset

1.2. Check for the misleading rows

their is no duplicates in the dataframe

1.3. Clean the data

What is the structure of your dataset?

The Prosper loan dataset contains 113,937 observations of 81 variables. The observations refer to loan listings on Prosper.com from late 2005 until 2014, and various characteristics of those loans. The data seems “tidy,” according to Hadley Wickham’s definition: the variable names are not variables themselves, so there is not much work required in the way of “tidying” the data.

What is/are the main feature(s) of interest in your dataset?

The histogram of Loan Origination Quarter shows a big dip in listings from Q4 2008 into 2009-10. This time period coincides with the (A) collapse of Lehman Brothers and the ensuing fallout in the global financial system, and (B) Prosper’s decision to register with the SEC. It appears that some combination of A and B caused Prosper to change how it does business. It will be interesting to take a look at how Prosper’s business changed over time.

I'm most interested in figureing out what features are best for predicting the borrower's Annual Percentage Rate (APR) for the loan.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

My main investigation will be into how Prosper’s credit policies and borrower characteristics changed over time as its business model and the larger lending environment evolved. So having the Loan Origination Quarter is critical. And being able to see how much Prosper’s lenders earned or lost on each loan with the Payments, Fees, and Loss fields could prove helpful in illuminating what is going on with these loans.

I expect that the total loan amount will have a negative effect on the APR of the loan: the larger the total loan amount, the lower the APR. I also think that the borrowers stated monthly income, loan term, Prosper rating, employment status will also have effects on the APR.

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

2.1. BorrowerAPR Distribution

interest rates are pretty high, averaging 19.28%. There are also material spikes at higher interest rates over 30%. Since most of these loans are for debt consolidation, these high rates must still be better than the interest rates the borrowers would have to pay to credit card companies. Or, perhaps the borrowers see value in freeing up the borrowing limit on their credit cards, even if it means paying high rates to Prosper lenders.

2.2. Loan Original Amount Distribution

The loan amounts skew relatively small, with the median of 6,500 less than the mean of 8,337. Based on the past two histograms, a typical loan is for consolidating less than 10,000 of debt (probably credit card debt).

2.3. On Time Prosper Payments Distribution

OnTimeProsperPayments ditribution between 0 to 70.

2.4. Monthly Loan Payment Ditribution

MonthlyLoanPayment distribution is between 100 to 700, the highest distribution in 200.

2.5. Debt To Income Ratio Ditribution

The vast majority of Debt-to-Income Ratios are less than 0.5. Excluding the outliers on the high side on the right, the data is close to having the shape of a normal distribution, though it is skewed slightly left.

2.6. Loan Origination Quarter Ditribution

This time frame coincides with the collapse of Lehman Brothers and the subsequent fallout in the global financial system from Q4 2008 to 2009. It took almost four years before the listing rate hit the levels of Q2 2008. Although Prosper is an alternative to conventional loan models, its company appears not to have been resistant to the global economic crisis. I'm wondering now if this is an economic crisis? induced Thrive to alter the way it does business. Perhaps, before the financial crisis, Prosper's credit policies were much looser? I'm going to set the issue away for later. It appears, after all, that only in July 2009 did Prosper create its Prosper Rating and Prosper Ranking.

2.7. Term Ditribution

Loan terms are either 12, 36, or 60 months, with the vast majority being 36-months.

2.8. Credit Grade Ditribution

The most redundent credite grade in the data is C and D and the other nearly same distribution and least category is NC

2.9. Loan Status Ditribution

Overall, it appears that a large majority of loans are either Completed or Current, though there are also a large number of Charged-off and Defaulted (non-performing). A little later I will want to look at loan performance based on origination vintage.

2.10. Prosper Score Ditribution

This field is more or less normally distributed. We can conclude that, overall, risk ratings are relatively normal across the sample.

2.11. Borrower State Ditribution

The most Borrower states is CA.

2.12. Income Verifiable Ditribution

Most of IncomeVerifiable is True

2.13. Is Borrower Homeowner Ditribution

The data is normaly distributed in IsBorrowerHomeowner

2.14. Occupation Ditribution

he majority put “Other” for their occupation. That is not too helpful. The largest non-Other category is “Professional,” which is also another unhelpful, generic catch-all. But perhaps we can compare the loan performance of some of the other borrower professions down the road.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

From the brevious analysis I found that:

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distributions of stated monthly income is highly right screwed. Most stated monthly incomes are less than 30k, but some of them are incredibly high, like greater than 100k. Surprisingly, most of borrowers with greater than 100k monthly income only loan less than 5k dollars. So, the very large stated monthly income may be made up. Overall, Less than 0.3 percent borrowers have stated monthly income greater than 30k, these can be seemed as outlier for the following analysis, so it is better to remove borrower records with income greater than 30k.

The majority of loans are actually current loans. Since our main goal is to define driving factors of outcome of loan we are not interested in any current loans (and loans with specified past due period) also chargedoff loans can be considered as defaulted.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

3.1. BorrowerAPR

3.1.1. BorrowerARP with Borrower credit information

As the credit grade decreases the Borrower APR increases, and the narrower in distribution decreases. In the prosper score as it increases the Borrower APR decreases.

3.1.2. BorrowerARP with Borrower personal information

Not quite information in this analysis except the borrower who hasn't have home slitly have a high borrower APR

3.1.3. BorrowerARP with Borrower records information

As the public records and Delinquencies increases the Borrower APR slitly decreses, but still not an effecient information.

3.1.4. BorrowerARP with lender information

As the number of investors increases the Borrower APR decreases.

3.1.5. BorrowerARP with payment information

There is no additional information in this analysis

3.2. ProsperScore

3.2.1. ProsperScore with Borrower credit information

3.2.2. ProsperScore with Borrower personal information

3.2.3. ProsperScore with payment information

3.3. Investors

3.3.1. Investors with Borrower credit information

As decreasing the credite grade the number of investors decreases and increasing the prosper the number of investor increases.

3.3.2. Investors with Borrower personal information

the highest investor number in the employment stast full-time and the highest number of investors in the lowest range in dent to income ratio.

3.3.3. Investors with payment information

the lowest the time in payment the higher the investor numbers.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

4.1. payment correlation

Interestingly there is no strong correlation between variables in this data set. There is some moderate positive and moderate negative correlation.

4.2. BorrowerAPR in terms of loan amount and duration

Term doesn't seem to have effect on relationship of APR and loan amount

4.3. BorrowerAPR in terms of loan amount and duration

The loan amount increases with better rating. The borrower APR decreases with better rating. Interestingly, the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings are increased from HR to A or better. This may because people with A or AA ratings tend to borrow more money, increasting APR could prevent them borrow even more and maximize the profit. But people with lower ratings tend to borrow less money, decreasing APR could encourage them to borrow more.

4.4. ProsperScore by loan duration

Interestingly, the borrower APR decreases with increasing Prosper rate. But for people with 4.0,7.0 ratings, the APR increase with the decrease of borrow term. Debt to income ration decreases with increasing the prosper score but nearly the sam in 2.0,3.0,4.0,5.0. Investor increases with increasing the prosper score. The loan original amount increases with terms and with prosper score.

4.5. Loan Origination Quarter

4.6. LoanOriginationQuarter for each ProsperScore

The Loan original quarter increases with moving forward in years and increases it's duration. The higher prosper score is nearly vanished in earlyer years from 2009 to 2011.

4.7. BorrowerAPR and EmploymentStatus for loan terms

The most distributed data in whole loan term is employed status followed by full time status.

4.8. ProsperScore and EmploymentStatus

middle ratings seem to have greater proportions of individuals with employment status Not Employed, Self-employed, Retired and Part-Time.

4.9. BorrowerAPR and ProsperRating for loan status

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Our initial assumptions were strengthened. The outcome of credit depends on Prosper raing, Term, Employment status. Defaulted credits tend to be larger than completed for all ratings except the lowest ones. In terms of purposes of credits more default prone are other and business categories (business category also tend to have larger loans). Long term (60 months) loans are riskier than mid-term and short term.

I extended my investigation of borrower APR against loan amount by looking at the impact of the Prosper rating. The multivariate exploration showed that the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings increased from HR to AA. I then explored the rating and term effects on loan amount, it shows that with better Prosper rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.

Were there any interesting or surprising interactions between features?

A surprising interaction is that the borrower APR and loan amount is negatively correlated when the Prosper ratings are from HR to B, but the correlation is turned to be positive when the ratings are A and AA.

Interesting find was that defaulted credits for individuals with high Prosper ratings tend to be larger than completed credits. Another interesting find that individuals with lowerst rating (HR) have only mid-term (36 months) credits

At the end of your report, make sure that you export the notebook as an html file from the File > Download as... > HTML menu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!